Deep learning network for evolutionary conservation

ABSTRACT

The technology disclosed relates to a deep learning network system for evolutionary conservation prediction. In one implementation, the system includes a first model for processing a first multiple sequence alignment that aligns a query sequence with a masked base at a target position to N non-query sequences and predicting a first identity of the masked base at the target position. The system also includes a second model for processing a second multiple sequence alignment that aligns the query sequence to M non-query sequences, where M&gt;N, and predicting a second identity of the masked base at the target position. The system further includes an evolutionary conservation determination logic configured to measure an evolutionary conservation of the masked base at the target position based on the first and second identities of the masked base.

PRIORITY APPLICATIONS

This application claims the benefit of and priority to the following:

U.S. Provisional Patent Application No. 63/294,813, titled “PERIODICMASK PATTERN FOR REVELATION LANGUAGE MODELS,” filed Dec. 29, 2021(Attorney Docket No. ILLM 1063-1/IP-2296-PRV);

U.S. Provisional Patent Application No. 63/294,816, titled “CLASSIFYINGMILLIONS OF VARIANTS OF UNCERTAIN SIGNIFICANCE USING PRIMATE SEQUENCINGAND DEEP LEARNING,” filed Dec. 29, 2021 (Attorney Docket No. ILLM1064-1/IP-2297-PRV);

U.S. Provisional Patent Application No. 63/294,820, titled “IDENTIFYINGGENES WITH DIFFERENTIAL SELECTIVE CONSTRAINT BETWEEN HUMANS ANDNON-HUMAN PRIMATES,” filed Dec. 29, 2021 (Attorney Docket No. ILLM1065-1/IP-2298-PRV);

U.S. Provisional Patent Application No. 63/294,827, titled “DEEPLEARNING NETWORK FOR EVOLUTIONARY CONSERVATION,” filed Dec. 29, 2021(Attorney Docket No. ILLM 1066-1/IP-2299-PRV);

U.S. Provisional Patent Application No. 63/294,828, titled “INTER-MODELPREDICTION SCORE RECALIBRATION,” filed Dec. 29, 2021 (Attorney DocketNo. ILLM 1067-1/IP-2301-PRV); and

U.S. Provisional Patent Application No. 63/294,830, titled“SPECIES-DIFFERENTIABLE EVOLUTIONARY PROFILES,” filed Dec. 29, 2021(Attorney Docket No. ILLM 1068-1/IP-2302-PRV).

The priority applications are incorporated by reference as if fully setforth herein.

FIELD OF THE TECHNOLOGY DISCLOSED

The technology disclosed relates to artificial intelligence typecomputers and digital data processing systems and corresponding dataprocessing methods and products for emulation of intelligence (i.e.,knowledge-based systems, reasoning systems, and knowledge acquisitionsystems); and including systems for reasoning with uncertainty (e.g.,fuzzy logic systems), adaptive systems, machine learning systems, andartificial neural networks. In particular, the technology disclosedrelates to using deep convolutional neural networks to analyze ordereddata.

INCORPORATIONS

The following are incorporated by reference for all purposes as if fullyset forth herein, and should be considered part of, this nonprovisionalpatent filing:

-   Sundaram, L. et al. Predicting the clinical impact of human mutation    with deep neural networks. Nat. Genet. 50, 1161-1170 (2018);-   Jaganathan, K. et al. Predicting splicing from primary sequence with    deep learning. Cell 176, 535-548 (2019);-   U.S. Patent Application No. 62/573,144, titled “TRAINING A DEEP    PATHOGENICITY CLASSIFIER USING LARGE-SCALE BENIGN TRAINING DATA,”    filed Oct. 16, 2017 (Attorney Docket No. ILLM 1000-1/IP-1611-PRV);-   U.S. Patent Application No. 62/573,149, titled “PATHOGENICITY    CLASSIFIER BASED ON DEEP CONVOLUTIONAL NEURAL NETWORKS (CNNs),”    filed Oct. 16, 2017 (Attorney Docket No. ILLM 1000-2/IP-1612-PRV);-   U.S. Patent Application No. 62/573,153, titled “DEEP SEMI-SUPERVISED    LEARNING THAT GENERATES LARGE-SCALE PATHOGENIC TRAINING DATA,” filed    Oct. 16, 2017 (Attorney Docket No. ILLM 1000-3/IP-1613-PRV);-   U.S. Patent Application No. 62/582,898, titled “PATHOGENICITY    CLASSIFICATION OF GENOMIC DATA USING DEEP CONVOLUTIONAL NEURAL    NETWORKS (CNNs),” filed Nov. 7, 2017 (Attorney Docket No. ILLM    1000-4/IP-1618-PRV);-   U.S. patent application Ser. No. 16/160,903, titled “DEEP    LEARNING-BASED TECHNIQUES FOR TRAINING DEEP CONVOLUTIONAL NEURAL    NETWORKS,” filed on Oct. 15, 2018 (Attorney Docket No. ILLM    1000-5/IP-1611-US);-   U.S. patent application Ser. No. 16/160,986, titled “DEEP    CONVOLUTIONAL NEURAL NETWORKS FOR VARIANT CLASSIFICATION,” filed on    Oct. 15, 2018 (Attorney Docket No. ILLM 1000-6/IP-1612-US);-   U.S. patent application Ser. No. 16/160,968, titled “SEMI-SUPERVISED    LEARNING FOR TRAINING AN ENSEMBLE OF DEEP CONVOLUTIONAL NEURAL    NETWORKS,” filed on Oct. 15, 2018 (Attorney Docket No. ILLM    1000-7/IP-1613-US);-   U.S. patent application Ser. No. 16/160,978, titled “DEEP    LEARNING-BASED SPLICE SITE CLASSIFICATION,” filed on Oct. 15, 2018    (Attorney Docket No. ILLM 1001-4/IP-1680-US);-   U.S. patent application Ser. No. 16/407,149, titled “DEEP    LEARNING-BASED TECHNIQUES FOR PRE-TRAINING DEEP CONVOLUTIONAL NEURAL    NETWORKS,” filed May 8, 2019 (Attorney Docket No. ILLM    1010-1/IP-1734-US);-   U.S. patent application Ser. No. 17/232,056, titled “DEEP    CONVOLUTIONAL NEURAL NETWORKS TO PREDICT VARIANT PATHOGENICITY USING    THREE-DIMENSIONAL (3D) PROTEIN STRUCTURES,” filed on Apr. 15, 2021,    (Atty. Docket No. ILLM 1037-2/IP-2051-US);-   U.S. Patent Application No. 63/175,495, titled “MULTI-CHANNEL    PROTEIN VOXELIZATION TO PREDICT VARIANT PATHOGENICITY USING DEEP    CONVOLUTIONAL NEURAL NETWORKS,” filed on Apr. 15, 2021, (Atty.    Docket No. ILLM 1047-1/IP-2142-PRV);-   U.S. Patent Application No. 63/175,767, titled “EFFICIENT    VOXELIZATION FOR DEEP LEARNING,” filed on Apr. 16, 2021, (Atty.    Docket No. ILLM 1048-1/IP-2143-PRV);-   U.S. patent application Ser. No. 17/468,411, titled “ARTIFICIAL    INTELLIGENCE-BASED ANALYSIS OF PROTEIN THREE-DIMENSIONAL (3D)    STRUCTURES,” filed on Sep. 7, 2021, (Atty. Docket No. ILLM    1037-3/IP-2051A-US);-   U.S. Provisional Patent Application No. 63/253,122, titled “PROTEIN    STRUCTURE-BASED PROTEIN LANGUAGE MODELS,” filed Oct. 6, 2021    (Attorney Docket No. ILLM 1050-1/IP-2164-PRV);-   U.S. Provisional Patent Application No. 63/281,579, titled    “PREDICTING VARIANT PATHOGENICITY FROM EVOLUTIONARY CONSERVATION    USING THREE-DIMENSIONAL (3D) PROTEIN STRUCTURE VOXELS,” filed Nov.    19, 2021 (Attorney Docket No. ILLM 1060-1/IP-2270-PRV); and-   U.S. Provisional Patent Application No. 63/281,592, titled “COMBINED    AND TRANSFER LEARNING OF A VARIANT PATHOGENICITY PREDICTOR USING    GAPED AND NON-GAPED PROTEIN SAMPLES,” filed Nov. 19, 2021 (Attorney    Docket No. ILLM 1061-1/IP-2271-PRV).

BACKGROUND

Mutations occur spontaneously in each generation. Harmful mutations arelost from the gene pool because the individuals carrying them reproduceless effectively. Harmless (or very rarely, beneficial) mutations arekept in the gene pool, producing variability. Genes that arefunctionally critical are conserved in the gene pool. The degree ofevolutionary conservation of genes reflects a balance between itsnatural tendency to mutate and the overall need to retain the structuralintegrity and essential functions. The presence of similarly conservedgenes, portions of genes, or chromosome segments in different speciesreflects both the common origin of the species and important functionalproperty of the conserved genomic elements. Furthermore, in directionalselections, a phenotype is favored over other phenotypes, causing anincrease in the frequency of alleles that provide fitness advantages andtheir conservation through speciation events (i.e., lineage-splittingevents that produce two or more separate species) and genomeduplications. Consequently, conservation analysis relying on sequencesimilarities among different species has been exploited.

Novel functional genomic elements in the primate lineage are primecandidates to understand the changes that have contributed to theuniqueness of our own species. Although comparisons between the humangenome with those of other mammal and vertebrate species have revealedan extensive catalog of conserved genes and regulatory elements, theconserved sequence elements that are specific to primates have beenchallenging to identify due to the short evolutionary distances betweenprimate species. At these short evolutionary timescales, it is unclearwhether the absence of changes between species is due to evolutionaryconservation, or simply because there has not been an opportunity forrandom mutations to arise. Consequently, the genomic adaptations alongthe phylogenetic branch from which the human species ultimately emergeremain largely undiscovered.

Compared to the mammalian lineage which includes over 6000 speciesseparated by ˜200 million years of evolution, the primate lineagecomprises 521 species that are separated by a fraction of this time.Thus, despite 43 primate species having been sequenced and aligned todate in the recent Zoonomia study of 241 placental mammals, the totalphylogenetic branch length within these primates is only 10% that of theplacental mammal alignment. These challenges make it clear that asuccessful strategy to identify sequence elements that are exclusivelyconserved in primates will require improved algorithms for detectingconservation at short evolutionary distances, as well sequencing datafrom a far greater number of primate species.

Given that sequencing data is multi- and high-dimensional, deep neuralnetworks have great promise in analyzing sequenced data because of theirbroad applicability and enhanced prediction power. Different types ofneural networks have been adapted to solve sequence-based problems ingenomics such as motif discovery, pathogenic variant identification, andgene expression inference. More importantly, an opportunity arises touse deep learning models to detect conservation at short evolutionarytimescales.

BRIEF DESCRIPTION OF THE DRAWINGS

The patent or application file contains at least one drawing executed incolor. Copies of this patent or patent application publication withcolor drawing(s) will be provided by the Office upon request and paymentof the necessary fee.

The color drawings also may be available in PAIR via the SupplementalContent tab. In the drawings, like reference characters generally referto like parts throughout the different views. Also, the drawings are notnecessarily to scale, with an emphasis instead generally being placedupon illustrating the principles of the technology disclosed. In thefollowing description, various implementations of the technologydisclosed are described with reference to the following drawings, inwhich:

FIG. 1 illustrates an example system of detecting evolutionaryconservation using two deep learning network models, in accordance withone implementation of the technology disclosed.

FIG. 2 illustrates an example of multiple sequence alignment, inaccordance with one implementation of the technology disclosed.

FIG. 3 is an example ideogram of the human genome depicting the averagenumber of species covering each position in the multiple sequencealignment, in accordance with one implementation of the technologydisclosed.

FIG. 4A illustrates a genome-wide distribution of primate assemblycoverage for each 100 kb segment of human genome, in accordance with oneimplementation of the technology disclosed.

FIG. 4B illustrates per base mismatch rate between newly generated shortread contigs and species with previously existing reference assemblies,in accordance with one implementation of the technology disclosed.

FIG. 5 is a schematic representation of an encoder-decoder architecture,in accordance with one implementation of the technology disclosed.

FIG. 6 shows an overview of an attention mechanism added onto an RNNencoder-decoder architecture, in accordance with one implementation ofthe technology disclosed.

FIG. 7 is a schematic representation of the calculation ofself-attention showing one attention head, in accordance with oneimplementation of the technology disclosed.

FIG. 8 illustrates several attention heads in a Transformer block, inaccordance with one implementation of the technology disclosed.

FIG. 9 illustrates parallel execution of multi-head attention logics, inaccordance with one implementation of the technology disclosed.

FIG. 10 illustrates one encoder layer of a Transformer network, inaccordance with one implementation of the technology disclosed.

FIG. 11 illustrates a schematic overview of a Transformer model, inaccordance with one implementation of the technology disclosed.

FIG. 12A shows a Vision Transformer (ViT), in accordance with oneimplementation of the technology disclosed.

FIG. 12B shows a Transformer block used by the Vision Transformer, inaccordance with one implementation of the technology disclosed.

FIGS. 13A, 13B, 13C, and 13D show details of the Transformer block ofFIG. 12B, in accordance with one implementation of the technologydisclosed.

FIG. 14 shows an example source code implementing the VisionTransformer, in accordance with one implementation of the technologydisclosed.

FIG. 15 illustrates an example of the first model predicting an identityof a masked base on the query sequence aligned with non-query sequences,in accordance with one implementation of the technology disclosed.

FIG. 16 illustrates an example architecture of the first model withlayer dimensions, in accordance with one implementation of thetechnology disclosed.

FIG. 17 illustrates an example of the second model predicting anidentity of a masked base on the query sequence aligned with non-querysequences, in accordance with one implementation of the technologydisclosed.

FIG. 18 illustrates an example architecture of the second model withlayer dimensions, in accordance with one implementation of thetechnology disclosed.

FIG. 19 illustrates training and testing curves for the first model(e.g., neutral substitution rate model) and the second model (e.g.,conservation-adjusted substitution rate model), in accordance with oneimplementation of the technology disclosed.

FIG. 20 is a table showing the training and testing accuracy inpredicting the masked nucleotide using the first model (e.g., neutralsubstitution rate model) and the second model (e.g.,conservation-adjusted substitution rate model), respectively, inaccordance with one implementation of the technology disclosed.

FIG. 21 illustrates comparison results of mean conservation scores atprotein-coding exons with a constant 130 nucleotide (nt) size and thesame codon reading frame (red), as well as 30 nt of flanking intronicsequence on either side (blue), predicted by an example system inaccordance with one implementation of the technology disclosed andPhyloP method, respectively.

FIGS. 22A and 22B illustrate the distribution of conservation scores inprotein-coding exons at codon position 2 (blue) vs codon position 3(orange) using an example system in accordance with one implementationof the technology disclosed and PhyloP method, respectively.

FIG. 23 illustrates the enrichment of highly conserved nucleotides(>95^(th) percentile of scores) falling in codon position 2 relative tocodon position 3, where the highly conserved nucleotides are identifiedby an example system in accordance with one implementation of thetechnology disclosed and PhyloP method, respectively.

FIG. 24 is an example computer system that can be used to implement thetechnology disclosed.

DETAILED DESCRIPTION

The following discussion is presented to enable any person skilled inthe art to make and use the technology disclosed and is provided in thecontext of a particular application and its requirements. Variousmodifications to the disclosed implementations will be readily apparentto those skilled in the art, and the general principles defined hereinmay be applied to other implementations and applications withoutdeparting from the spirit and scope of the technology disclosed. Thus,the technology disclosed is not intended to be limited to theimplementations shown but is to be accorded the widest scope consistentwith the principles and features disclosed herein.

The detailed description of various implementations will be betterunderstood when read in conjunction with the appended drawings. To theextent that the figures illustrate diagrams of the functional blocks ofthe various implementations, the functional blocks are not necessarilyindicative of the division between hardware circuitry. Thus, forexample, one or more of the functional blocks (e.g., modules,processors, or memories) may be implemented in a single piece ofhardware (e.g., a general-purpose signal processor or a block ofrandom-access memory, hard disk, or the like) or multiple pieces ofhardware. Similarly, the programs may be stand-alone programs, may beincorporated as subroutines in an operating system, may function in aninstalled software package, and the like. It should be understood thatthe various implementations are not limited to the arrangements andinstrumentality shown in the drawings.

The processing engines and databases of the figures, designated asmodules, can be implemented in hardware or software, and need not bedivided up in precisely the same blocks as shown in the figures. Some ofthe modules can also be implemented on different processors, computers,or servers, or spread among a number of different processors, computers,or servers. In addition, it will be appreciated that some of the modulescan be combined, operated in parallel or in a different sequence thanthat shown in the figures without affecting the functions achieved. Themodules in the figures can also be thought of as flowchart steps in amethod. A module also need not necessarily have all its code disposedcontiguously in memory; some parts of the code can be separated fromother parts of the code with code from other modules or other functionsdisposed in between.

Rule-based logic as used herein (e.g., evolutionary conservationdetermination logic), can be implemented in the form of a computerproduct including a non-transitory computer readable storage medium withcomputer usable program code for performing the method steps describedherein. The “logic” can be implemented in the form of an apparatusincluding a memory and at least one processor that is coupled to thememory and operative to perform exemplary method steps. The rule-basedevolutionary conservation determination logic can be implemented in theform of means for carrying out one or more of the method steps describedherein; the means can include (i) hardware module(s), (ii) softwaremodule(s) executing on one or more hardware processors, or (iii) acombination of hardware and software modules; any of (i)-(iii) implementthe specific techniques set forth herein, and the software modules arestored in a computer readable storage medium (or multiple such media).In one implementation, the logic implements a data processing function.The logic can be a general purpose, single core or multicore, processorwith a computer program specifying the function, a digital signalprocessor with a computer program, configurable logic such as an FPGAwith a configuration file, a special purpose circuit such as a statemachine, or any combination of these. Also, a computer program productcan embody the computer program and configuration file portions of thelogic.

System of Deep Learning Networks for Evolutionary ConservationDetermination

FIG. 1 illustrates an example system 100 for detecting evolutionaryconservation using two deep learning network models, in accordance withone implementation of the technology disclosed. The first input 102includes a multiple sequence alignment (MSA) that aligns a querysequence to a plurality of non-query sequences. The query sequenceincludes a masked base at a target position that is flanked by right andleft bases. For example, the query sequence can be a 5 nucleotide (nt)sequence, where the masked base represented by “?” is flanked by tworight and two left bases. Four 5 nt non-query sequences are aligned withthe query sequence in the MSA. Similarly, the second input 112 caninclude MSA that aligns a query sequence to a plurality of non-querysequences. The number of non-query sequences in the second input 112 canbe larger than the number of non-query sequences in the first input 102.

As an example, the selected query sequence and non-query sequences canindicate ancestral alleles (e.g., sequences found in the last commonancestor of two species). Knowing which allele is ancestral is importantfor understanding genome evolution. The ancestral allele information iscrucial for increasing accuracy when estimating allele ages and canprovide a better understanding of genomic signatures due to selectionpressures. In addition, knowing the ancestral alleles of variants canprovide specific explanations regarding the formation of linkagedisequilibrium patterns in the genome. Ancestral allele information isalso potentially helpful for understanding the rise and extinction ofdisease-causing variants and disease etiology.

It should also be noted that the technology disclosed does not intend tolimit the number of non-query sequences in the MSA as the first orsecond input, which can be 3, 5, 10, 50, 100, 150, 200, 250, 300 and soon. Further, the technology disclosed does not intend to limit thelength of query sequences and non-query sequences, which can be 1-10 nt,11-50 nt, 51-100 nt, 101-200 nt, 201-500 nt and so on.

The first model 104 processes the first input 102 and predicts anidentity of the masked base 106. In some implementations, the predictedidentity is a probability distribution that specifies base-wiselikelihoods of the masked base at the target position being adenine (A),cytosine (C), thymine (T), and guanine (G). In other implementations,the predicted identity is a probability distribution that specifiesbase-wise likelihoods of the masked base at the target position being A,C, T, G, an unknown base (X), and a missing base (-). In someimplementations, the probability distribution is a softmax distribution.

In one implementation, the first model 104 is a background mutationestimation model (also called neutral substitution rate model)configured to encode a background mutation probability estimation in theidentity of the masked base at the target position. Background mutationprobability estimation measures random mutation experienced in genes.The first model 104 can estimate a background mutation probability bypredicting the identity of a masked based at a targeted position of aquery sequence from human genomes aligned with genomes from otherspecies that are closely related to human.

The second model 114 processes the second input 112 and predicts anidentity of the masked base 116. In some implementations, the predictedidentity is a probability distribution that specifies base-wiselikelihoods of the masked base at the target position being A, C, T andG. In other implementations, the predicted identity is a probabilitydistribution that specifies base-wise likelihoods of the masked base atthe target position being A, C, T, G, an unknown base (X), and a missingbase (-). In some implementations, the probability distribution is asoftmax distribution.

In one implementation, the second model 114 is a group-specific mutationprobability estimation model configured to encode a group-specificmutation probability estimation in the identity of the masked base atthe target position. In one implementation, the second model 114 canmeasure epistasis-driven mutations experienced in genes of a particulargroup of species. Epistasis occurs when the combined effect of two ormore mutations differs from the sum of their individual effects. Inother words, the effect of the mutation is dependent on the geneticcontext in which it appears. Such dependencies on genetic background orcontext can cause phenotypic diversity within individuals and, onevolutionary timescales, incompatibilities between different species.For example, some mutations that cause diseases in humans are fixed inthe genomes of other species without any apparent deleteriousconsequences. Hence, the measurement of epistasis-driven mutations in aparticular group of species allows the prediction of evolutionaryoutcomes, which will have profound implications for disease diagnosisand treatment. As an example, sickle-cell anemia is a multigenic diseaseat the phenotypic level. It is clinically evident because out of fifteenof the potential complications, each patient exhibits three to five ofthem and in addition, with significant inter-individual variations inintensity. It is caused by three types of genes: the gene housing theprimary mutation; pleiotropic genes (genes involved in secondarypathophysiologic events beyond the primary mutation); and epistaticgenes (polymorphism in pleiotropic genes) that significantly modulatethe pathophysiology of the disease in a particular patient). Themodulations of epistatic genes are an important part of the phenotypeand explain the intense inter-individual differences in the severity ofthe disease, in spite of all the patients having the same primarymutation. R. L. Nagel, C. R. Biologies 328, 606-615 (2005). Theidentification of epistasis-driven mutations and evaluation of theirpotential effects can provide information in diagnosis diseases in lightof the complexity of different implications of patients anddisease-related phenotypes and potentially, information inpatient-specific treatment.

In some implementations, the first model 104 and the second model 114are neural networks, for example, feed-forward neural networks that aretrained using loss functions via backpropagation. Without limiting thescope of the technology disclosed, the first model 104 and the secondmodel 114 can be convolutional neural networks (CNNs), recurrent neuralnetworks (RNNs), long short-term memory (LSTM) networks, time seriesdata processing models using time series data as input (e.g.,sequencing-by-synthesis (SBS) sequencing data). More examplearchitectures of the first model 104 and the second model 114 will bedescribed in detail below in accordance with FIGS. 5-14 .

The first model 104 and the second model 114 can estimate evolutionaryconservation-related information that is implicitly encoded in the MSAas second-order signals. The term second-order signals representinformation that is implicitly encoded and can be derived from the MSAvia the processing of the first model 104 and the second model 114.Evolutionary conservation-related information can include evolutionarydistances and length of the branches in the phylogenetic trees.Evolutionary distances refer to the number of nucleotide substitutionsper site between two homologous DNA sequences or the number of aminoacid substitutions per site between two homologous protein sequences.Estimation of evolutionary distances between sequences is important forconstructing phylogenetic trees, dating species' divergences, andunderstanding the mechanisms of evolution of genes, proteins, andpopulations. Furthermore, branch length indicates genetic change. Thelonger the branch, the more genetic change (or divergence) has occurredduring evolution. One way of measuring the extent of genetic change isto estimate the average number of nucleotide or amino acid substitutionsper site. Thus, by measuring the frequency of the substitution of basesat particular positions across the MSA, the first model 104 and thesecond model 114 can estimate evolutionary distances and branch lengthsthat are implicitly encoded across the MSA.

The first model 104 and the second model 114 can predict alternativesplicing at the masked base in response to processing the MSA thataligns sequences of a given length (e.g., >5 nt). Alternative splicingis a process of selecting different combinations of splice sites withina messenger RNA precursor (pre-mRNA) to produce variably spliced mRNAs.For example, the first model 104 and the second model 114 can be used toidentify protein-coding genes that are conserved only in primates andfrequently alternatively spliced. These conservations that are exclusiveto primates and the frequent alternative splicing suggest that, as novelexons emerge, they are initially utilized in a limited capacity whileexploring sequence space for more functions.

The first model 104 and the second model 114 can be configured to detectconservation-driving genomic features. Noncoding sequence, whichcomprises nearly 99% of the human genome, is highly active in generegulation, and most genetic variants associated with complex humandiseases map to noncoding rather than protein-coding sequence,particularly to cis-regulatory elements such as active enhancers andDNase-I hypersensitivity sites. The first model 104 and the second model114 can identify hundreds of thousands of noncoding regulatory elementsat short branch lengths that are conserved only in primates and notacross placental mammals. Considering coding sequences only comprise 1%of the human genome, the emergence of human and primates from otherplacental mammals can be largely driven by novel regulatory sequenceelements rather than innovations in protein-coding sequences. The firstmodel 104 and the second model 114 can also identify human DNase-Ihypersensitivity sites that are conserved exclusively in primates atshort branch lengths. Existing methods (e.g., PhyloP method fordetecting nonneutral substitution rates on mammalian phylogenies) relyon long branch lengths and thus, are incapable of catching theseconservations. The first model 104 and the second model 114 add newlyidentified noncoding regulatory elements, suggesting a significantportion of them that were previously believed to lack conservation mayrepresent recent evolutionary adaptations, with sequence conservationdetectable in primates.

The first model 104 and the second model 114 can be configured todetermine a function or feature of the masked base in dependence uponthe evolutionary conservation. Hundreds of thousands of genetic variantsare determined via genome-wide association study to be associated withhuman diseases, the majority of which map to noncoding cis-regulatoryelements (e.g., active enhancers and DNase-I hypersensitivity sites).The first model 104 and the second model 114 can predict the identitiesof the masked base, as regulatory elements. These identifieddisease-associated regulatory elements and their correspondingconversations can be used to analyze whether the evolutionary sequenceadaptations that occurred even closer to the origin of the human specieswould yield additional conserved cis-regulatory elements associated withfunctions. For example, the primate species can be subdivided intosubsets of monkey and ape species that are diverged about 37 millionyears ago, less than half the divergence time of the overall primatelineage. The first model 104 and the second model 114 can identifyadditional noncoding regulatory elements that are conserved exclusivelyin monkey and ape species but not across all primates or mammals. Theseelements identified by the two models, based on their correspondingevolutionary conservations, suggest a substantial fraction ofdisease-associated genetic variation are attributed to functionallyconserved, disease-associated regulatory elements with recentevolutionary origins.

As further illustrated in FIG. 1 , the system 100 includes anevolutionary conservation determination logic 108 that measuresevolutionary conservation of the masked base 110, based on theidentities of the masked base at the target position generated from thefirst and second models, respectively. The evolutionary conservation canbe measured as a frequency of occurrence of a particular base at aparticular position across the MSA. Alternatively, the evolutionaryconservation can be measured as a frequency of substitution of aparticular base at a particular position across the MSA.

The evolutionary conservation determination logic 108 can measure theevolutionary conservation of the masked base 110 by comparing the twoidentities (106 and 108) of the masked base predicted from the first andsecond models. For example, when the identities of the masked base arerepresented by probability distributions that specify base-wiselikelihoods of the masked base, the evolutionary conservationdetermination logic 108 defines a metric from the probability outputfrom the two models, where the metric is representative of anevolutionary conservation score of the masked base. In otherimplementations, the evolutionary conservation determination logic 108compares the two identities of the masked base predicted from the firstand second models and measures a difference thereof, using, for example,T-test, KL divergence, average loss. The measured difference canrepresent the evolutionary conservation of the masked base.

In one implementation, the first model is a neutral substitution ratemodel, and the first identity of the masked base 106 represents aneutral substitution rate of the marked base. The second model is aconservation-adjusted substitution rate model, and the second identityrepresents a conservation-adjusted substitution rate of the base. Toestimate conservation at the individual nucleotide level, theconservation-adjusted substitution rate is compared with the neutralsubstitution rate. For each individual nucleotide x, the predictedprobability of the ancestral allele from both the neutral substitutionrate model (n_(x)) and the conservation-adjusted substitution rate model(c_(x)) are extracted. In one embodiment, the first model takes as inputa query sequence from human and four non-query sequences from fourspecies that are closest to human in terms of evolutionary distance,whereas the second model takes as input the query sequence and non-querysequences from 236 primate species. When a nucleotide is highlyconserved, the second model (e.g., conservation-adjusted substitutionrate model) predicts that it is less likely for the nucleotide to changefrom its ancestral state due to evolutionary constraint and thus, c_(x)can be significantly higher than n_(x).

In one implementation, the evolutionary conservation determination logic108 incorporates an ensemble approach to compute the significance of thedifference between c_(x) and n_(x). Consider a MSA including M sequencesfrom M respective species, each sequence corresponding to one species.Both the neutral substitution rate model and conservation-adjustedsubstitution rate model are trained centered on each of the M species.In particular, the corresponding sequence from each species can be aquery sequence, with a center nucleotide on the query sequence beingmasked.

For the neutral substitution rate model, when the sequence of a givenspecies among the M species works as a query sequence, sequences fromfour species that are closest to the given species are selected asnon-query sequences. For the conservation-adjusted substitution ratemodel, when the sequence of a given species among the M species works asa query sequence, all other M−1 sequences are non-query sequences.

From these M pairs of trained models for each species i, theevolutionary conservation determination logic 108 can extract predictedprobability of the ancestral alleles, p_(x) ^(i) and q_(x) ^(i).Extracting the ensemble probabilities for the neutral substitution ratemodel and conservation-adjusted substitution rate model allows thecomputation of the significance of the difference between the predictedprobability of the ancestral alleles, p_(x) ^(i) and q_(x) ^(i). Thedifference can represent the evolutionary conservation 110 of the maskedbase.

The t-test statistics between N_(x)={n_(x) ¹, n, . . . , n_(x) ^(M)} andC_(x)={c_(x) ¹, c_(x) ², . . . , c_(x) ^(M)}, assuming equal variance,can be computed by the evolutionary conservation determination logic108. When p_(x) is the p-value of the t-test statistics at base x, aconservation score S_(x) of each individual nucleotide x can also becalculated, where S_(x)=−log(p_(x)). It should be noted that here onlythe predicted probabilities from species with alignments present at theposition of interest are considered to compute the t-test statistics,since alignments for all species are not always available at eachposition of the human genome, while computing t-test statistics, onlythe predicted probabilities from species with alignments present at thebase of interest is considered.

We now turn to the advantages of the technology disclosed by using twomodels to predict evolutionary conservation on a resolution of singlenucleotides in the genome. By leveraging MSA from a group of species,the technology disclosed can identify a nucleotide or a region ofnucleotides in the genome that has been conserved across evolution, evenwhen the group of species has short evolutionary distances. Thecapability of identifying conserved regions of nucleotides without beinglimited by the short evolutionary distances among species, as describedin various embodiments below, can be particularly desirable. It allowsthe identification and conservation analysis of regions of nucleotides,for example, non-coding regulatory elements, protein-coding exons andepistasis-driven mutations among closely related species, when theseregions were previously missed by existing methods. These newlyidentified genomic elements in the primate lineage are the primecandidates to understand the changes that have contributed to theuniqueness of our own species. The use of MSA including 236 primatespecies substantially expands the available sequence data representingevolutionary conservation at short evolutionary distances. The use oftwo models with different MSA input improves the sensitivity andaccurate modeling in identifying regions of conserved nucleotides andtheir corresponding conservation. The technology disclosed brings animproved interpretation of the genome and the importance of the genomicfeatures. Various embodiments of the technology disclosed can be used toidentify crucial evolution features to a group of closely relatedspecies, for example, microRNAs that are specific to primates,alternatively spliced exons conserved in primates but not in otherbranches, primate-conserved noncoding cis-regulatory elements, andepistasis-driven mutations experienced in genes of a particular group ofspecies.

Unlike existing models that are highly sensitive to the distribution ofbranch lengths in the phylogenetic tree and thus only work well whenmeasuring evolutionary conservation among groups of species with largeevolutionary distances, various embodiments of the technology discloseddo not require explicit tree structures or branch links. In other words,it relaxes on the need to explicitly depend on branch lengths. Thetechnology disclosed identifies nucleotide positions that have remainedconserved during evolution and detects evolutionary constraints in thelineage of interest at increased resolution. Hence, it allows theinvestigation of the genomics of shared and specialized traits in groupsof species.

Input to First Model and Second Model

In one implementation, the total number of sequences in the first input102 is less than 10 (e.g., 5). The total number of sequences in thesecond input 112 is more than the total number in the first input 102,for example, 20, 50, 100, 150, 200, 210, 220, 230, 240 and so on. Thenon-query sequences in the second input 112 can include part or all ofthe non-query sequences in the first input 102. In anotherimplementation, all non-query sequences in the first and/or second inputare closest homologues to their respective query sequences.

The query sequence in the first input 102 and/or the second input 112can belong to one species of interest, for example, human. The non-querysequences in the first input 102 and/or the second input 112 can belongto non-human primates. In one implementation, the non-query sequences inthe first input 102 can belong to a group of species that shares afamily (e.g., hominids) with the species of interest. In anotherimplementation, the group of species shares an order (e.g., primates)with the species of interest. The non-query sequences in the secondinput 112 can belong to another group of species that shares a class(e.g., mammals) with the species of interest. In one implementation, thenon-query sequences in the second input 112 can share a phylum (e.g.,chordates) with the species of interest. In another implementation, thenon-query sequences in the second input 112 can share a kingdom (e.g.,animals) with the species of interest.

In some implementations, an average evolutionary distance determinedfrom evolutionary distances between species within the group of speciesfor non-query sequences in the first input 102 is less than an averageevolutionary distance determined from evolutionary distances betweenspecies within the group of species for non-query sequences in thesecond input 112. Evolutionary distances can refer to the number ofsubstitutions per site separating a pair of homologous sequences sincethe sequences are diverged from their common ancestral sequence, and canbe computed by aligning the pair of homologous sequences. In otherimplementations, the two groups of species for non-query sequences inthe first and second input have overlap. That is, some species arepresent in both groups. The non-overlapping species can be moreevolutionarily distant from the species of interest than the overlappingspecies. To include evolutionarily distant species as input to thesecond model but not in the first model and compare the predictedidentities of the masked base from the two models can revealevolutionary conservation information of the masked base, taking intoconsideration species that are closest to the species of interest andevolutionarily distant species with larger evolutionary distances.

As described above, the query sequence with a masked base at a targetposition is aligned with a plurality of non-query sequences from otherspecies in the MSA. MSA aligns multiple homologous sequences with atarget and is an important step in comparative analyses and propertyprediction of biological sequences since a lot of information, forexample, evolution and coevolution clusters, are generated from thealignment and can be mapped to the target sequence of choice. MSA alsoprovides information in evolutionary conservation by showing conservedregions across multiple species.

FIG. 2 illustrates an example of MSA 200 including aligned genomicsequences of a variety of species including human, chimpanzee, bonobo,gorilla, etc. A person skilled in the art will appreciate MSA 200 caninclude more sequences. In one implementation, MSA 200 aligns at leasthundreds of sequences. In another implementation, MSA 200 alignsthousands of sequences. It is also noted that the query sequence andnon-query sequences in the MSA are not limited to genomic sequences.Various sequences of residues including amino acids, nucleotides,carbohydrates, lipids, or other types of biological polymers can beused. Each residue can be a position/element in the sequences, which arearranged as multiple sequence alignments.

In one implementation, a MSA of 50 primate reference genomes isaugmented using sequence data from 556 individuals drawn from additional193 primate species. For each new species, an approximate referencegenome is derived from combining its sequence data with the referencegenome of the species in the initial MSA that is evolutionarily closestto it (i.e., base species). Read pairs from each sample are firstassembled with Megahit (version 1.2.9, default parameters), resulting inassemblies with an average contig N50 of 34 Kb. The assembly sizes rangefrom 2.2 Gb-3.7 Gb, thus falling within the typical range of reportedgenome sizes for primates. The resulting contigs are then aligned to thereference genome of the base species with minimap2 (version 2.22,parameters -ax asm10). Finally, variants are called with the paftoolsutility that accompanies minimap2, only considering alignments fromcontigs at least 1000 bp in length (parameter -L 1000). The assemblingreads prior to alignment are better placed to detect more complexvariation between the derived and initial species than resequencing byalignment of individual read pairs.

For each derived species, the discovered variants are moved into thereference genome of its associated base species with bcftools (i.e., aset of tools that manipulate variant calls in the variant call formatand its binary counterpart) to build an approximate consensus for thederived species. Given the aligned positions of the base genome in theinitial MSA, the same sites are extracted from the derived genome(taking into consideration coordinate changes caused by insertions anddeletions) to create a set of alignment lines for the derived specieswhich were then inserted into the MSA for the expanded species set.

This procedure can be validated by applying it to species for which awell-characterized reference sequence is already available, for examplealigning contigs assembled from a gorilla sample to the chimpanzeereference sequence to create an approximation of the gorilla genome.Error rates are computed based only on those fragments of theapproximate and “true” references that participate in the MSA, since itis only these regions that are relevant to our subsequent analysis. Itis considered that an acceptable rate of error to be one thatapproximates the known rate of heterozygosity between individuals of thespecies.

In one implementation, to discover primate-specific conserved elements,multiple individuals from 211 primate species, which combined with otherearlier reference assemblies, brought us to a total of 236 out of the521 extant primate species. For each species, a PCR-free whole genomesequencing is performed with 150 bp paired end reads, obtaining anaverage of 57×-whole genome coverage per species. Megahit (i.e., anext-generation sequencing assembler optimized for metagenomes) is usedto perform a de novo assembly of the reads into contigs. Thus, theexisting MSA of 50 primate reference assemblies is expanded to a236-species MSA, by aligning each short read contig to the closestavailable reference species. FIG. 3 is an example ideogram of the humangenome depicting the average number of species covering each position inthe MSA. As illustrated, the human genome is well covered by primatealignments except for telomeric, centromeric, and heterochromaticregions, which are highly repetitive in the human reference. Compared toexisting models that use 30 primate species, the constructed 236-speciesMSA used as input to the second model 114 significantly expands thesequence data from a far greater number of primate species, andfacilitates the identification of the masked base and correspondingconversation at short evolutionary distances.

To quantify the conservation across the human genome, on average, eachbase in the human genome is covered by 185 other primate species onaverage, and 99.2% percent of the euchromatic regions of the humangenome is covered by at least 100 other primate species. FIG. 4Aillustrates a genome-wide distribution of primate assembly coverage foreach 100 kb segment of human genome. Over 2.8 Gb of the human genome arecovered by at least 100 primate species. The sequence data of a set of25 species within the 236 species is analyzed to ensure the per baseerror rate in the de novo assemblies is sufficiently low for subsequentconservation analysis. The data representing both the short read contigsand reference genome assemblies is compared with long read assembliesthat had been generated. The rates of mismatches between these assemblypairs ranged between 0.02-0.5% across these 25 species and is largelyexplained by differences in the species' heterozygosity. FIG. 4Billustrates per base mismatch rate between newly generated short readcontigs and species with previously existing reference assemblies. It isfound that a majority of the mismatch rate (e.g., 82%) can be explainedby allelic variation within the species. When the mismatch rate causedby heterozygosity is excluded the residual per base error rates arebelow 0.1%. When the intraspecific variations are further excluded, theaverage remaining mismatch rates attributable to assembly and sequencingerrors are reduced to 0.04%.

Thus, the constructed MSA including 236 primate species increases thephylogenetic branch length 2.8-fold over the previously available 30primate species alignment from the Zoonomia study. This MSA can be usedas, for example, the second input 112 to the second model 114, such thatit can detect conservation at short evolutionary distances using a fargreater number of primate species.

Next, we turn to the first and the second models 104 and 114, andexample architectures thereof.

Architectures of First Model and Second Model

The following discussion provides different examples of machine learningarchitectures that can be used to implement the first model and thesecond model. These example machine learning architectures can take asinput machine-processable or vectorized representations of genomic data,for example, one-hot encodings of nucleotides and/or amino acids,process the machine-processable representations through a plurality ofhidden layers and weights of the machine learning architectures, producelearned or alternative or intermediate or compressed representations ofthe machine-processable representations, and generate one or moreoutputs based on the learned or alternative or intermediate orcompressed representations. These outputs can be genotype predictionsidentifying one or more attributes or identifies of the genomic data,such as the identity of the nucleotides and/or amino acids, evolutionaryconservation states of the nucleotides and/or amino acids, thepathogenicity of the nucleotides and/or amino acids, and so on.

In one implementation, the first model 104 and the second model 114 area multilayer perceptron (MLP). In another implementation, the firstmodel 104 and the second model 114 are neural networks, for example,feedforward neural networks. In yet another implementation, the firstmodel 104 and the second model 114 are fully-connected neural networks,for example, fully-connected convolution neural networks. In yet furtherimplementation, the first model 104 and the second model 114 aresemantic segmentation neural networks. In yet another furtherimplementation, the first model 104 and the second model 114 are agenerative adversarial network (GAN) (e.g., CycleGAN, StyleGAN,pixelRNN, text-2-image, DiscoGAN, IsGAN). In yet another furtherimplementation, the first model 104 and the second model 114 aretransformer-based networks, for example, Transformer-based CNNs.

In one implementation, the first model 104 and the second model 114 areconvolution neural networks (CNNs) with a plurality of convolutionlayers. In another implementation, the first model 104 and the secondmodel 114 are recurrent neural networks (RNNs) such as long short-termmemory networks (LSTM), bi-directional LSTM (Bi-LSTM), or a gatedrecurrent unit (GRU). In yet another implementation, the first model 104and the second model 114 include both a CNN and an RNN.

In yet other implementations, the first model 104 and the second model114 can use 1D convolutions, 2D convolutions, 3D convolutions, 4Dconvolutions, 5D convolutions, dilated or atrous convolutions, transposeconvolutions, depthwise separable convolutions, pointwise convolutions,1×1 convolutions, group convolutions, flattened convolutions, spatialand cross-channel convolutions, shuffled grouped convolutions, spatialseparable convolutions, and deconvolutions. The first model 104 and thesecond model 114 can use one or more loss functions such as logisticregression/log loss, multi-class cross-entropy/softmax loss, binarycross-entropy loss, mean-squared error loss, L1 loss, L2 loss, smooth L1loss, and Huber loss. The first model 104 and the second model 114 canuse any parallelism, efficiency, and compression schemes such TFRecords,compressed encoding (e.g., PNG), sharding, parallel calls for maptransformation, batching, prefetching, model parallelism, dataparallelism, and synchronous/asynchronous stochastic gradient descent(SGD). The first model 104 and the second model 114 can includeupsampling layers, downsampling layers, recurrent connections, gates andgated memory units (like an LSTM or GRU), residual blocks, residualconnections, highway connections, skip connections, peepholeconnections, activation functions (e.g., non-linear transformationfunctions like rectifying linear unit (ReLU), leaky ReLU, exponentiallinear unit (ELU), sigmoid and hyperbolic tangent (tanh)), batchnormalization layers, regularization layers, dropout, pooling layers(e.g., max or average pooling), global average pooling layers, andattention mechanisms (e.g., self-attention).

The first model 104 and the second model 114 can be a rule-based model,linear regression model, a logistic regression model, an Elastic Netmodel, a support vector machine (SVM), a random forest (RF), a decisiontree, and a boosted decision tree (e.g., XGBoost), or some othertree-based logic (e.g., metric trees, kd-trees, R-trees, universalB-trees, X-trees, ball trees, locality sensitive hashes, and invertedindexes). The first model 104 and the second model 114 can be anensemble of multiple models, in some implementations.

The first model 104 and the second model 114 can be trained usingbackpropagation-based gradient update techniques. Example gradientdescent techniques that can be used for training the first model 104 andthe second model 114 include stochastic gradient descent, batch gradientdescent, and mini-batch gradient descent. Some examples of gradientdescent optimization algorithms that can be used to train the firstmodel 104 and the second model 114 can include Momentum, Nesterovaccelerated gradient, Adagrad, Adadelta, RMSprop, Adam, AdaMax, Nadam,and AMSGrad.

Examples of generative models, including transformer-based models andencoder-decoder models are described in the following. Generative modelsare often used in natural language processing. In recent years,generative models have emerged as powerful machine-learning tools todiscover evolutionary and functional information of biological sequences(e.g., genomic sequences and protein sequences). Biological sequencescan be transformed into vector representations using word embedding. Thevector representations can be executed efficaciously on genomeidentification.

Transformer-Based Models

In different implementations, the first model 104 and the second model114 include self-attention mechanisms like Transformer, VisionTransformer (ViT), Bidirectional Transformer (BERT), DetectionTransformer (DETR), Deformable DETR, UP-DETR, DeiT, Swin, GPT, iGPT,GPT-2, GPT-3, BERT, SpanBERT, RoBERTa, XLNet, ELECTRA, UniLM, BART, T5,ERNIE (THU), KnowBERT, DeiT-Ti, DeiT-S, DeiT-B, T2T-ViT-14, T2T-ViT-19,T2T-ViT-24, PVT-Small, PVT-Medium, PVT-Large, TNT-S, TNT-B, CPVT-S,CPVT-S-GAP, CPVT-B, Swin-T, Swin-S, Swin-B, Twins-SVT-S, Twins-SVT-B,Twins-SVT-L, Shuffle-T, Shuffle-S, Shuffle-B, XCiT-S12/16, CMT-S, CMT-B,VOLO-D1, VOLO-D2, VOLO-D3, VOLO-D4, MoCo v3, ACT, TSP, Max-DeepLab,VisTR, SETR, Hand-Transformer, HOT-Net, METRO, Image Transformer, Tamingtransformer, TransGAN, IPT, TTSR, STTN, Masked Transformer, CLIP,DALL-E, Cogview, UniT, ASH, TinyBert, FullyQT, ConvBert, FCOS, FasterR-CNN+FPN, DETR-DC5, TSP-FCOS, TSP-RCNN, ACT+MKDD (L=32), ACT+MKDD(L=16), SMCA, Efficient DETR, UP-DETR, UP-DETR, ViTB/16-FRCNN,ViT-B/16-FRCNN, PVT-Small+RetinaNet, Swin-T+RetinaNet, Swin-T+ATSS,PVT-Small+DETR, TNT-S+DETR, YOLOS-Ti, YOLOS-S, and YOLOS-B.

Transformer Logic

Machine learning is the use and development of computer systems that canlearn and adapt without following explicit instructions, by usingalgorithms and statistical models to analyze and draw inferences frompatterns in data. Some of the state-of-the-art models use Transformers,a more powerful and faster model than neural networks alone. Neuralnetworks process input in series (e.g., time series data includingsequencing-by-synthesis (SBS) sequencing data) and weight relationshipsby distance in the series. Transformers can process input in paralleland do not necessarily weight by distance. Transformers can be used inaddition to neural networks. This architecture is described here.

Encoder-Decoder Architecture

FIG. 5 is a schematic representation of an encoder-decoder architecture.This architecture is often used for time-series data processing (e.g.,sequencing data generated via sequencing-by-synthesis) and has two mainbuilding blocks. The first building block is the encoder that encodes aninput into a fixed-size vector. In the system described herein, theencoder is based on a recurrent neural network (RNN). At each time step,t, a hidden state of time step, t−1, is combined with the input value attime step t to compute the hidden state at timestep t. The hidden stateat the last time step, encoded in a context vector, containsrelationships encoded at all previous time steps.

The context vector is then passed to the second building block, thedecoder. For translation, the decoder has been trained on a secondlanguage. Conditioned on the input context vector, the decoder generatesan output sequence. At each time step, t, the decoder is fed the hiddenstate of time step, t−1, and the output generated at time step, t−1. Thefirst hidden state in the decoder is the context vector, generated bythe encoder. The context vector is used by the decoder to perform thetranslation.

The whole model is optimized end-to-end by using backpropagation, amethod of training a neural network in which the initial system outputis compared to the desired output and the system is adjusted until thedifference is minimized. In backpropagation, the encoder is trained toextract the right information from the input sequence, the decoder istrained to capture the grammar and vocabulary of the output language.This results in a fluent model that uses context and generalizes well.When training an encoder-decoder model, the real output sequence is usedto train the model to prevent mistakes from stacking. When testing themodel, the previously predicted output value is used to predict the nextone.

When performing a translation task using the encoder-decoderarchitecture, all information about the input sequence is forced intoone vector, the context vector. Information connecting the beginning ofthe sentence with the end is lost, the vanishing gradient problem. Also,different parts of the input sequence are important for different partsof the output sequence, information that cannot be learned using onlyRNNs in an encoder-decoder architecture.

Attention Mechanism

Attention mechanisms distinguish Transformers from other machinelearning models. The attention mechanism provides a solution for thevanishing gradient problem. FIG. 6 shows an overview of an attentionmechanism added onto an RNN encoder-decoder architecture. At every step,the decoder is given an attention score, e, for each encoder hiddenstate. In other words, the decoder is given weights for eachrelationship between words in a sentence. The decoder uses the attentionscore concatenated with the context vector during decoding. The outputof the decoder at time step t is be based on all encoder hidden statesand the attention outputs. The attention output captures the relevantcontext for time step t from the original sentence. Thus, words at theend of a sentence may now have a strong relationship with words at thebeginning of the sentence. In the sentence “The quick brown fox, uponarriving at the doghouse, jumped over the lazy dog,” fox and dog can beclosely related despite being far apart in this complex sentence.

To weight encoder hidden states, a dot product between the decoderhidden state of the current time step, and all encoder hidden states, iscalculated. This results in an attention score for every encoder hiddenstate. The attention scores are higher for those encoder hidden statesthat are similar to the decoder hidden state of the current time step.Higher values for the dot product indicate the vectors are pointing moreclosely in the same direction. The attention scores are converted tofractions that sum to one using the SoftMax function.

The SoftMax scores provide an attention distribution. The x-axis of thedistribution is the position in a sentence. The y-axis is attentionweight. The scores show which encoder hidden states are most closelyrelated. The SoftMax scores specify which encoder hidden states are themost relevant for the decoder hidden state of the current time step.

The elements of the attention distribution are used as weights tocalculate a weighted sum over the different encoder hidden states. Theoutcome of the weighted sum is called the attention output. Theattention output is used to predict the output, often in combination(concatenation) with the decoder hidden states. Thus, both informationabout the inputs, as well as the already generated outputs, can be usedto predict the next outputs.

By making it possible to focus on specific parts of the input in everydecoder step, the attention mechanism solves the vanishing gradientproblem. By using attention, information flows more directly to thedecoder. It does not pass through many hidden states. Interpreting theattention step can give insights into the data. Attention can be thoughtof as a soft alignment. The words in the input sequence with a highattention score align with the current target word. Attention describeslong-range dependencies better than RNN alone. This enables the analysisof longer, more complex sentences.

The attention mechanism can be generalized as: given a set of vectorvalues and a vector query, attention is a technique to compute aweighted sum of the vector values, dependent on the vector query. Thevector values are the encoder hidden states, and the vector query is thedecoder hidden state at the current time step.

The weighted sum can be considered a selective summary of theinformation present in the vector values. The vector query determines onwhich of the vector values to focus. Thus, a fixed-size representationof the vector values can be created, in dependence upon the vectorquery.

The attention scores can be calculated by the dot product, or byweighting the different values (multiplicative attention).

Embeddings

For most machine learning models, the input to the model needs to benumerical. The input to a translation model is a sentence, and words arenot numerical. Multiple methods exist for the conversion of words intonumerical vectors. These numerical vectors are called the embeddings ofthe words. Embeddings can be used to convert any type of symbolicrepresentation into a numerical one.

Embeddings can be created by using one-hot encoding. The one-hot vectorrepresenting the symbols has the same length as the total number ofpossible different symbols. Each position in the one-hot vectorcorresponds to a specific symbol. For example, when converting colors toa numerical vector, the length of the one-hot vector would be the totalnumber of different colors present in the dataset. For each input, thelocation corresponding to the color of that value is one, whereas allthe other locations are valued at zero. This works well for working withimages. For natural language processing (NLP), this becomes problematic,because the number of words in a language is very large. This results inenormous models and the need for a lot of computational power.Furthermore, no specific information is captured with one-hot encoding.From the numerical representation, it is not clear that orange and redare more similar than orange and green. For this reason, other methodsexist.

A second way of creating embeddings is by creating feature vectors.Every symbol has its specific vector representation, based on features.With colors, a vector of three elements could be used, where theelements represent the amount of yellow, red, and/or blue needed tocreate the color. Thus, all colors can be represented by only using avector of three elements. Also, similar colors, have similarrepresentation vectors.

Embedding based on context, can be trained. Words with similar meaningsoccur in similar contexts. At nucleotide level, particular combinationsof three DNA or RNA nucleotides correspond to specific amino acids orstop signals during protein synthesis. In addition, homologous proteinsor genes have sequence similarity that reflects common ancestry.Different methods take the context into account. For natural languageprocess, some methods, like GloVe, base their context embedding onco-occurrence statistics from corpora (large texts) such as Wikipedia.Words with similar co-occurrence statistics have similar wordembeddings. Other methods use neural networks to train the embeddings.For example, they train their embeddings to predict the word based onthe context (Common Bag of Words), and/or to predict the context basedon the word (Skip-Gram). Training these contextual embeddings is timeintensive. For this reason, pre-trained libraries exist. Other deeplearning methods can be used to create embeddings. For example, thelatent space of a variational autoencoder (VAE) can be used as theembedding of the input. Another method is to use 1D convolutions tocreate embeddings. This causes a sparse, high-dimensional input space tobe converted to a denser, low-dimensional feature space.

Self-Attention: Queries (Q), Keys (K), Values (V)

Transformer models are based on the principle of self-attention.Self-attention allows each element of the input sequence to look at allother elements in the input sequence and search for clues that can helpit to create a more meaningful encoding. It is a way to look at whichother sequence elements are relevant for the current element. TheTransformer can grab context from both before and after the currentlyprocessed element.

When performing self-attention, three vectors need to be created foreach element of the encoder input: the query vector (Q), the key vector(K), and the value vector (V). These vectors are created by performingmatrix multiplications between the input embedding vector using threeunique weight matrices.

After this, self-attention scores are calculated. When calculatingself-attention scores for a given element, the dot products between thequery vector of this element and the key vectors of all other inputelements are calculated. To make the model mathematically more stable,these self-attention scores are divided by the root of the size of thevectors. This has the effect of reducing the importance of the scalarthus emphasizing the importance of the direction of the vector. Just asbefore, these scores are normalized with a SoftMax layer. This attentiondistribution is then used to calculate a weighted sum of the valuevectors, resulting in a vector z for every input element. In theattention principle explained above, the vector to calculate attentionscores and to perform the weighted sum was the same, in self-attentiontwo different vectors are created and used. As the self-attention needsto be calculated for all elements (thus a query for every element), oneformula can be created to calculate a Z matrix. The rows of this Zmatrix are the z vectors for every sequence input element, giving thematrix a size length sequence dimension QKV.

Multi-headed attention is executed in the Transformer. FIG. 7 is aschematic representation of the calculation of self-attention showingone attention head. For every attention head, different weight matricesare trained to calculate Q, K, and V. Every attention head outputs amatrix Z. Different attention heads can capture different types ofinformation. The different Z matrices of the different attention headsare concatenated. This matrix can become large when multiple attentionheads are used. To reduce dimensionality, an extra weight matrix W istrained to condense the different attention heads into a matrix with thesame size as one Z matrix. This way, the amount of data given to thenext step does not enlarge every time self-attention is performed.

When performing self-attention, information about the order of thedifferent elements within the sequence is lost. To address this problem,positional encodings are added to the embedding vectors. Every positionhas its unique positional encoding vector. These vectors follow aspecific pattern, which the Transformer model can learn to recognize.This way, the model can consider distances between the differentelements.

As discussed above, in the core of self-attention are three objects:queries (Q), keys (K), and values (V). Each of these objects has aninner semantic meaning of their purpose. One can think of these asanalogous to databases. We have a user-defined query of what the userwants to know. Then we have the relations in the database, i.e., thevalues which are the weights. More advanced database management systemscreate some apt representation of its relations to retrieve values moreefficiently from the relations. This can be achieved by using indexes,which represent information about what is stored in the database. In thecontext of attention, indexes can be thought of as keys. So instead ofrunning the query against values directly, the query is first executedon the indexes to retrieve where the relevant values or weights arestored. Lastly, these weights are run against the original values toretrieve data that are most relevant to the initial query.

FIG. 8 depicts several attention heads in a Transformer block. We cansee that the outputs of queries and keys dot products in differentattention heads are differently colored. This depicts the capability ofthe multi-head attention to focus on different aspects of the input andaggregate the obtained information by multiplying the input withdifferent attention weights.

Examples of attention calculation include scaled dot-product attentionand additive attention. There are several reasons why scaled dot-productattention is used in the Transformers. Firstly, the scaled dot-productattention is relatively fast to compute, since its main parts are matrixoperations that can be run on modern hardware accelerators. Secondly, itperforms similarly well for smaller dimensions of the K matrix, dk, asthe additive attention. For larger dk, the scaled dot-product attentionperforms a bit worse because dot products can cause the vanishinggradient problem. This is compensated via the scaling factor, which isdefined as √{square root over (dk)}.

As discussed above, the attention function takes as input three objects:key, value, and query. In the context of Transformers, these objects arematrices of shapes (n, d), where n is the number of elements in theinput sequence and dis the hidden representation of each element (alsocalled the hidden vector). Attention is then computed as:

${{Attention}\left( {,K,V} \right)} = {{SoftMax}\left( \frac{QK^{T}}{\sqrt{dk}} \right)V}$

-   -   where        , K, V are computed as:

X·W _(Q) ,X·W _(K) ,X·W _(V)

X is the input matrix and W_(Q), W_(K), W_(V) are learned weights toproject the input matrix into the representations. The dot productsappearing in the attention function are exploited for their geometricalinterpretation where higher values of their results mean that the inputsare more similar, i.e., pointing in the geometrical space into the samedirection. Since the attention function now works with matrices, the dotproduct becomes matrix multiplication. The SoftMax function is used tonormalize the attention weights into the value of 1 prior to beingmultiplied by the values matrix. The resulting matrix is used either asinput into another layer of attention or becomes the output of theTransformer.

Multi-Head Attention

Transformers become even more powerful when multi-head attention isused. Queries, keys, and values are computed the same way as above,though they are now projected into h different representations ofsmaller dimensions using a set of h learned weights. Each representationis passed into a different scaled dot-product attention block called ahead. The head then computes its output using the same procedure asdescribed above.

Formally, the multi-head attention is defined as

MultiHeadAttention(Q,K,V)=[head₁, . . . ,head_(h) ]W ₀

where head_(i)=Attention(QW _(i) ^(Q) KW _(i) ^(K) ,VW _(i) ^(V))

The outputs of all heads are concatenated together and projected againusing the learned weights matrix W₀ to match the dimensions expected bythe next block of heads or the output of the Transformer. Using themulti-head attention instead of the simpler scaled dot-product attentionenables Transformers to jointly attend to information from differentrepresentation subspaces at different positions.

As shown in FIG. 9 , one can use multiple workers to compute themulti-head attention in parallel, as the respective heads compute theiroutputs independently of one another. Parallel processing is one of theadvantages of Transformers over RNNs.

Assuming the naive matrix multiplication algorithm which has acomplexity of:

a·b·c

For matrices of shape (a, b) and (c, d), to obtain values

, K, V, we need to compute the operations:

X·W _(Q) ,X·W _(K) ,X·WV

The matrix X is of shape (n, d) where n is the number of patches and dis the hidden vector dimension. The weights

, W_(K), W_(V) are all of shape (d, d). Omitting the constant factor 3,the resulting complexity is:

n·d ²

We can proceed to the estimation of the complexity of the attentionfunction itself, i.e., of SoftMax

$\left( \frac{QK^{T}}{\sqrt{dk}} \right)$

V. The matrices

and K are both of shape (n, d). The transposition operation does notinfluence the asymptotic complexity of computing the dot product ofmatrices of shapes (n, d)·(d, n), therefore its complexity is:

n ² ·d

Scaling by a constant factor of √{square root over (dk)}, where dk isthe dimension of the keys vector, as well as applying the SoftMaxfunction, both have the complexity of a·b for a matrix of shape (a, b),hence they do not influence the asymptotic complexity. Lastly the dotproduct SoftMax

$\left( \frac{QK^{T}}{\sqrt{dk}} \right)$

·V is between matrices of shapes (n, n) and (n, d) and so its complexityis:

n ² ·d

The final asymptotic complexity of scaled dot-product attention isobtained by summing the complexities of computing

, K, V, and of the attention function

n·d ² +n ² ·d.

The asymptotic complexity of multi-head attention is the same since theoriginal input matrix X is projected into h matrices of shapes

$\left( {n,\frac{d}{h}} \right),$

where his the number of heads. From the view of asymptotic complexity, his constant, therefore we would arrive at the same estimate ofasymptotic complexity using a similar approach as for the scaleddot-product attention.

Transformer models often have the encoder-decoder architecture, althoughthis is not necessarily the case. The encoder is built out of differentencoder layers which are all constructed in the same way. The positionalencodings are added to the embedding vectors. Afterward, self-attentionis performed.

Encoder Block of Transformer

FIG. 10 portrays one encoder layer of a Transformer network. Everyself-attention layer is surrounded by a residual connection, summing upthe output and input of the self-attention. This sum is normalized, andthe normalized vectors are fed to a feed-forward layer. Every z vectoris fed separately to this feed-forward layer. The feed-forward layer iswrapped in a residual connection and the outcome is normalized too.Often, numerous encoder layers are piled to form the encoder. The outputof the encoder is a fixed-size vector for every element of the inputsequence.

Just like the encoder, the decoder is built from different decoderlayers. In the decoder, a modified version of self-attention takesplace. The query vector is only compared to the keys of previous outputsequence elements. The elements further in the sequence are not knownyet, as they still must be predicted. No information about these outputelements may be used.

Encoder-Decoder Blocks of Transformer

FIG. 11 shows a schematic overview of a Transformer model. Next to aself-attention layer, a layer of encoder-decoder attention is present inthe decoder, in which the decoder can examine the last Z vectors of theencoder, providing fluent information transmission. The ultimate decoderlayer is a feed-forward layer. All layers are packed in a residualconnection. This allows the decoder to examine all previously predictedoutputs and all encoded input vectors to predict the next output. Thus,information from the encoder is provided to the decoder, which couldimprove the predictive capacity. The output vectors of the last decoderlayer need to be processed to form the output of the entire system. Thisis done by a combination of a feed-forward layer and a SoftMax function.The output corresponding to the highest probability is the predictedoutput value for a subject time step.

For some tasks other than translation, only an encoder is needed. Thisis true for both document classification and name entity recognition. Inthese cases, the encoded input vectors are the input of the feed-forwardlayer and the SoftMax layer. Transformer models have been extensivelyapplied in time-series data processing fields. These models haveapplications in the field of biology as well for predicting proteinstructure and function and labeling DNA sequences.

Vision Transformer

There are extensive applications of transformers in vision includingpopular recognition tasks (e.g., image classification, object detection,action recognition, and segmentation), generative modeling, multi-modaltasks (e.g., visual-question answering, visual reasoning, and visualgrounding), video processing (e.g., activity recognition, videoforecasting), low-level vision (e.g., image super-resolution, imageenhancement, and colorization) and 3D analysis (e.g., point cloudclassification and segmentation).

In image classification, we often have a single input image in which thepixels are in a sequence. To reduce the computation required, VisionTransformers (ViTs) cut the input image into a set of fixed-sizedpatches of pixels. The patches are often 16×16 pixels. ViTs are depictedin FIGS. 12A, 12B, 13A, 13B, 13C, and 13D. Unfortunately, importantpositional information is lost because image sets areposition-invariant. This problem is solved by adding a learnedpositional encoding into the image patches.

The computations of the ViT architecture can be summarized as follows.The first layer of a ViT extracts a fixed number of patches from aninput image (FIG. 12A). The patches are then projected to linearembeddings. A special class token vector is added to the sequence ofembedding vectors to include all representative information of alltokens through the multi-layer encoding procedure. The class vector isunique to each image. Vectors containing positional information arecombined with the embeddings and the class token. The sequence ofembedding vectors is passed into the Transformer blocks. The class tokenvector is extracted from the output of the last Transformer block and ispassed into a multilayer perceptron (MLP) head whose output is the finalclassification. The perceptron takes the normalized input and places theoutput in categories. It classifies the images. This procedure directlytranslates into the Python Keras code shown in FIG. 14 .

When the input image is split into patches, a fixed patch size isspecified before instantiating a ViT. Given the quadratic complexity ofattention, patch size has a large effect on the length of training andinference time. A single Transformer block comprises several layers. Thefirst layer implements Layer Normalization, followed by the multi-headattention that is responsible for the performance of ViTs. In thedepiction of a Transformer block in FIG. 9B, we can see two arrows.These are residual skip connections. Including skip connection data cansimplify the output and improve the results. The output of themulti-head attention is followed again by Layer Normalization. Andfinally, the output layer is an MLP (Multi-Layer Perceptron) with theGELU (Gaussian Error Linear Unit) activation function.

ViTs can be pretrained and fine-tuned. Pretraining is generally done ona large dataset. Fine-tuning is done on a domain specific dataset.

Domain-specific architectures, like convolutional neural networks (CNNs)or long short-term memory networks (LSTMs), have been derived from theusual architecture of MLPs and may suffer from so-called inductivebiases that predispose the networks towards a certain output. ViTsstepped in the opposite directions of CNNs and LSTMs and became moregeneral architectures by eliminating inductive biases. A ViT can be seenas a generalization of MLPs because MLPs, after being trained, do notchange their weights for different inputs. On the other hand, ViTscompute their attention weights at runtime based on the particularinput.

Transformer Models as Applied to Genomics

The following discussion describes some implementations of how theTransformer models process a genomic sequence and produce position-wisenucleotide classification of the genomic sequence.

The Transformer models include convolutional layers that can detectlocal patterns, and thereby enhance the detection of nucleotide motifs.The Transformer models process a genome sequence in consecutive segmentsof length l. Every input nucleotide x∈{A, C, G, T} is first transformedinto a vector embedding h(0), after which it is transformed k timesthrough addition (residual connection) with another vector, obtained bythe multi-head attention function present in each layer (h(0)→ . . .→h(k)).

A set of fully connected layers transforms h(k) into a model outputŷ(k). For each residual block, the vector that is summed with the input(to obtain h(1), . . . , h(k)) is calculated using the hidden states ofl upstream positions.

The multi-head attention applied in each residual block ismethodologically identical. From each input hidden state h, a query (q),key (k), and value (v) vector of equal shapes are calculated. The outputz of the attention head, applied on the hidden state at position n, iscalculated as follows:

${z^{(n)} = {{{softmax}{}\left( \frac{q^{(n)} \cdot K}{\sqrt{d_{head}}} \right)} \cdot V}},$

-   where K, V∈    ^(l×d) ^(heads) are the matrices that are composed from l upstream    hidden states (e.g., K=[k^((n-1)), . . . , k^((n))]).

The denominator is used to stabilize the scores based on the dimensionsof q, k, and v (d_(head)). The multiplication of the query vector withall the key vectors results in a vector of scores that is normalized forall input values using the softmax function. These scores are multipliedto the v vectors for the calculation of z (i.e., a linear combination).The attention scores denote the relevance of information present betweentwo positions, where the multiplication of the q and k vectors functionas a lock and key encoding, which returns goodness-of-fit scores for theinformation embedded in two hidden states (defined by v).

In each residual block, multiple attention heads are present (hence,multi-head attention), each featuring their own unique sets of modelweights to calculate q, k, and v. As such, multiple types of informationcan be extracted from the input hidden states. The outcome of differentattention heads within the same layer is further processed into a singlevector, which is summed with h to obtain the hidden state of the nextlayer (e.g., h(1)→h(2)).

Contextual information embedded within the hidden states derived fromsingle nucleotides is limited. Motifs formed from multiple neighboringnucleotides are deemed of greater importance towards biologicalprocesses. The addition of a convolutional layer allows the q, k, and vvectors to be derived from multiple neighboring hidden states withoutaffecting the input/output resolution. Thereby, the retrieval ofrelevant information using attention is improved, resulting in improvedpredictive performances on a variety of tasks.

Positional information is used within the vectors q, k, and v bysuperimposing (i.e., through summation) a positional encoding vector toh. The added signal is a function of the vector index and the relativepositioning with respect to the other input hidden states.

The annotation of DNA is a sequence labeling task that hascorrespondences in natural language processing. The DNA sequence is adata set of n nucleotides, i.e., X∈{x(1), x(2), . . . , x(n)}, wherex∈A, C, T, G, the task comprises predicting a label y 0, 1 for eachposition x, where a positive label denotes the occurrence of an event atthat position.

The Transformer models process the genome in sequential segments of lnucleotides. During training, a non-linear transformation function E isoptimized that maps the input classes {A, C, T, G} to a vector embeddingh of length d_(model). For nucleotide x(i) on the genome:

h=E(x ^((i))), x ^((i)) ∈{A,T,C,G},

-   -   where h∈        ^(d) ^(model) .

The hidden states of each segment H∈

^(txd) ^(model) , [h⁽¹⁾, . . . , h^((i))], are processed through klayers. As such, the data propagation through the network for any inputx follows multiple transformation: x→h^((0,:))→ . . . →h^((k,:))→ŷ.

Within each layer, multi-head attention is calculated for each hiddenstate. Next, for each hidden state of h, the output of the multi-headattention step (MultiHead) is summed with the input, i.e., a residualconnection, with the final step being layer normalization. Thecalculations of the output for all hidden states h in layer t atposition m of segment s are performed in parallel:

h ^((s,t+1,m))=LayerNorm(h ^((s,t,m))+MultiHead(H ^((s,t)))),

or

H ^((s,t+1))=LayerNorm(H ^((s,t))+MultiHead(H ^((s,t)))),

-   -   where t∈[0, k[and m∈[1, 1].

After a forward pass through k layers, a final linear combinationreduces the dimension of the output hidden state (d_(model)) to theamount of output classes. In one implementation, only binaryclassification is performed. In another implementation, a softmax layeris applied before obtaining the prediction value ŷ_(i) for nucleotidex_(i).

Particular Implementations of Deep Learning Networks for EvolutionaryConservation

FIG. 15 illustrates an example system 1500 including the first model1520 that predicts an identity of a masked base on the query sequencealigned with non-query sequences. The first input 1510 is a MSA of five5 nt sequences. The query sequence is from human with a masked baseflanked by two right and two left bases. Four non-query sequences arefrom four species closest to human, namely, Chimpanzee, gorilla, grayslender loris and sumatran orangutan.

In other implementations, the four species that are closest to human canbe Chimpanzee, bonobo, the western lowland gorilla, and the easterngorilla in terms of evolutionary distance. The assumption is that sincethese species are close to human in the phylogeny, with a mean branchlength of only 0.008, the predicted probability from this deep learningmodel should account for the neutral substitution rate in the genome.

The first model 1520 in this implementation is a neutral substitutionrate model which employs only information from the closest species tospecify the ancestral allele. The first model 1520 can be aconvolutional neural network including a plurality of 1D convolutionlayers 1522 and 1524 and one or more dense layers 1526. One-hot-encodedsequences in a form of vector are convoluted with kernels of the 1Dconvolution layers, via multiplication and summation operations. Thedense layer 1526 is deeply connected to its preceding layer and appliesactivation functions to the output from the preceding layer.

FIG. 16 illustrates an example architecture of the first model withlayer dimensions. As one of the most useful and efficient techniques ofanalyzing sequence data in deep learning architectures, 1D convolutionallayers are used here as core building blocks of the examplearchitecture. The R^((5×5)) dimensional input MSA (R²⁵ when flattened)is first passed through a masking layer that masks the center nucleotidein the human sequence. The MSA with masked nucleotide is then passedthrough an embedding layer that embeds the input MSA into a50-dimensional embedding space. This embedded MSA is next passed through32 1D convolutional units of kernel size of 5 and stride gap of 5. This1D convolutional layer is designed to process sequence information fromeach species. The next layer of 1D convolutional units also contains 32units and has a kernel size of 5 with stride gap of 1. This layer isdesigned to process sequence information across the species. Hence bystacking the convolutional layers we are designing the DL model toprocess sequence information both along the rows of the MSA and columnsof the MSA. The output of the latest convolutional layer is then passedthrough a dense layer. The final layer is a six-dimensional SoftMaxlayer. Except for the final layer, all the intermediate layers use ReLUactivation.

The first output 1530 is a predicted identity of the masked base in thequery sequence (labeled by “?”), which can be a probability distributionthat specifies base-wise likelihoods of the masked base at the targetposition being A, C, T, G, an unknown base (X), and a missing base (-).For example, the aforementioned six-dimensional SoftMax layer can outputthe predicted probabilities of the masked nucleotide in the neutralsubstitution rate (NR) model, Y_(NR)={y^(A),y^(C),y^(G),y^(T),y⁻,y^(X)}.Alternatively, a different SoftMax layer can output the predictedprobabilities of the masked nucleotide,Y_(NR)={y^(A),y^(C),y^(G),y^(T)}. FIG. 15 illustrates two examples ofthe predicted identity as a probability distribution of the masked base.The output 1540 illustrates that the predicted identify of the maskedbase is represented by a normalized probability distribution thatspecifies base-wise likelihoods of the base being A, C, T and G. Themasked base is predicted to be base C with a highest likelihood (p≈0.9).The output 1550 illustrates that predicted identity is a log probabilitydistribution that specifies base-wise likelihoods of the masked base atthe target position being A, C, T, G, an unknown base (X), and a missingbase (-). Consistent with the output 1540, the masked base is predictedto be base C with a highest likelihood (log p>10⁻¹).

FIG. 17 illustrates an example system 1700 including the second model1720 that predicts an identity of a masked base on the query sequencealigned with non-query sequences. Unlike the first input 1510 includingsequences from 5 closest species, the second input 1710 is a fullalignment of 236 primate species, including all of the non-querysequences in the first input 1510. The query sequence in the secondinput 1710 can be identical to the query sequence in the first input1510.

The second model 1720 in this implementation is a conservation-adjustedsubstitution rate model which utilizes information from the full MSA of236 primate species. The second model 1720 can be a convolutional neuralnetwork, including a plurality of 1D convolution layers 1722 and 1724 asbuilding blocks of the architecture, and one or more dense layers 1726.The number of 1D convolution layers (1722 and 1724) and dense layers1726 can be different from (e.g., larger than) the number of layers inthe first model.

Similar to the two types of output from the first model, here, theoutput 1740 illustrates the predicted identify of the masked base isrepresented by a normalized probability distribution that specifiesbase-wise likelihoods of the base being A, C, T and G. The masked baseis predicted to be base C with a highest likelihood (p≈1.0). The output1750 illustrates the predicted identity is a log probabilitydistribution that specifies base-wise likelihoods of the masked base atthe target position being A, C, T, G, an unknown base (X), and a missingbase (-). The masked base is predicted to be base C with a highestlikelihood (log p≈10⁰).

FIG. 18 illustrates an example architecture of the second model withlayer dimensions. In one implementation, to harness information from all230 species, the second model uses a larger number of convolutionallayers and dense layers as compared to the first model. The input forthe second model has a dimension of R^((230×5)) (R¹¹⁵⁰ when flattened).Like the first model, the input passes through the masking and embeddinglayers. To reduce overfitting, several dropout layers are usedthroughout the architecture of the conservation-adjusted substitutionrate model. The final SoftMax layer outputs the predicted probability ofthe masked nucleotide in the conservation-adjusted substitution rate(CR) model, Y_(CR)={y^(A),y^(C),y^(G),y^(T),y⁻,y^(X)}. The underlyingassumption is that, since the model is predicting the masked nucleotidegiven the sequence context of all 230 species, the output probabilitiesshould account for the evolutionary conservation adjusted substitutionrate.

If a nucleotide is highly conserved in the MSA, the second model (e.g.,conservation-adjusted substitution rate model) predicts that it is lesslikely for this masked human nucleotide to change from its ancestralstate due to evolutionary constraint. In the absence of conservation,the two models will converge on a higher substitution probabilityconsistent with the rate of change in neutral sequence.

The evolutionary conservation determination logic can be used todetermine how significantly the two models diverge in their predictionsfor each position in the genome based on the predicted identities (e.g.,represented by probability functions) of the masked base. Based on thedivergence, the evolutionary conservation determination logic cancalculate a per base conservation score to represent the respectiveevolutionary conservation.

The divergence can be determined using t-statistics, which measures theratio of the departure of an estimated value of a parameter from itshypothesized value to its standard error. Alternatively, the divergencecan be determined using KL divergence, D_(KL)(P|

), which measures how a probability distribution

is different from a reference probability distribution P.

Training of the First Model and the Second Model

Both the first and second models can be trained on ˜2.5 million randomlysampled 5 nt wide MSAs from the entire human genome. In oneimplementation, we trained both models using a query sequence fromhuman. That is, the first model is trained by using the query sequencefrom human and non-query sequences from four of the closest species tohuman in the MSA. The second model is trained by using the querysequence from human and non-query sequences from all 236 primatespecies. The center nucleotide on the query sequence is masked in bothof the models.

To train these models, we can minimize the categorical cross entropyloss between true nucleotide and the predicted masked nucleotide usingthe Adam optimizer. The models are trained withholding 20% data as atest set. In one implementation, the models are trained for 30 epochswith a batch size of 1000 samples. FIG. 19 illustrates training andtesting curves for the first model (e.g., neutral substitution ratemodel) and the second model (e.g., conservation-adjusted substitutionrate model). The x axis shows the number of epochs, and the y axis showsthe categorical cross entropy loss of the deep learning networks.

To calculate prediction probabilities for all bases genome-wide, thetrained model weights with lowest test loss can be chosen (e.g., weightsat epoch 14 for conservation-adjusted substitution rate model andweights at epoch 18 for neutral substitution rate model). Since theconservation-adjusted substitution rate model leverages sequence contextfrom a larger number of species, it achieves 11% lower test loss ascompared to the neutral substitution rate model. The accuracy ofpredicting the masked nucleotide in both models are shown in FIG. 20 .Even though the model only uses 5 nt wide MSAs, it achieves testaccuracy >99%. This suggests that for the task of masked nucleotideprediction from MSA, the 5 nt window length can be sufficient.

In some implementation, the deep learning architectures usingtransformers instead of 1D convolutional layers are tested. When using asmall length MSA, the transformer model achieves similar performance asthat of the model with 1D convolutional layer. For example, with 1Dconvolutional layers the conservation-adjusted substitution rate modelachieved a lowest test loss of 0.0413 and with transformers the modelachieved a lowest test loss of 0.0414. In other implementations,transformers, Long Short-Term Memory (LSTM), or any other recurrent deeplearning neural networks as described above can be used as the firstand/or second model to predict the evolutionary conservation on aresolution of single nucleotides in the MSA.

Objective Indicia of Inventiveness and Non-Obviousness

Evaluation of the Deep Learning Networks for Evolutionary Conservation

The performance of the technology disclosed is evaluated by comparingwith PhyloP method, a test used for detecting nonneutral substitutionrates on mammalian phylogenies. For PhyloP scores, the base wiseconservation scores are generated from an MSA containing 230 primatespecies using the phyloFit and phyloP functions from the PHAST package.Both PhyloP method and the technology disclosed are able to detectconservation in protein-coding exons over surrounding intronic sequence.FIG. 21 illustrates the mean conservation scores plotted for the deeplearning system (top) described herein and PhyloP method (bottom), atprotein-coding exons with a constant 130 nucleotide (nt) size and thesame codon reading frame (red), as well as 30 nt of flanking intronicsequence on either side (blue).

However, the deep learning conservation scores generated from variousimplementations of the technology disclosed are capable of finer graineddistinctions, compared to the step functions seen in the distribution ofPhyloP scores due to the discontinuities in phylogenetic branch length.FIGS. 22A and 22B illustrate the distribution of conservation scores inprotein-coding exons at codon position 2 (blue) compared to codonposition 3 (orange), measured by deep learning conservation scores (FIG.22A) and PhyloP scores (FIG. 22B), respectively. The x axis of FIGS. 22Aand 22B is a score distribution and the y axis is the score density.Consequently, when comparing the nucleotides that are most highly scoredby each method (>95^(th) percentile), the top conserved nucleotidesidentified by the deep learning system are highly enriched at codonposition 2 (where mutations always alter the protein) relative to codonposition 3 (where mutations are often synonymous), compared to weakerenrichment for PhyloP.

FIG. 23 is a bar chart showing the enrichment of highly conservednucleotides (>95^(th) percentile of scores) falling in codon position 2relative to codon position 3. The x axis is the percentile while the yaxis is an enrichment ratio between codon position 2 and codon position3 for the most highly conserved nucleotides identified by the deeplearning system (blue) and PhyloP method (orange). The highly conservednucleotides identified by the deep learning system are highly enrichedat codon position 2 compared to codon position 3.

PhyloP method relies on the prior of evolutionary distance between thespecies in the MSA. Hence, PhyloP is sensitive to the distribution ofbranch length in evolutionary tree. While it works well on measuringevolutionary conservation on a diverse group of species (e.g., mammals)which large evolutionary distances between species, it iscomputationally underpowered when measuring conservation on closelyrelated group of species (e.g., primates) with a short evolutionarydistance between each other. As a result, when PhyloP was used tocalculate per base conservation genome-wide, no individual bases arefound to be significantly conserved over chance (P<0.01), due to aninsufficient branch length within the primate species.

On the other hand, the deep learning system as described here canidentify sequence elements that are exclusively conserved in primatesand detecting conservation at short evolutionary distances. The deeplearning system's improved sensitivity can be attributed to moreaccurate modeling of substitution probabilities at nucleotides with highintrinsic mutation rates, which provide sufficient information toidentify conserved bases even within the short evolutionary timescalesin the primate lineage.

The deep learning system can identify protein-coding exons and genesthat are conserved only in primates. Genome-wide, only 0.5% of humanprotein-coding exons (1001 exons) are significantly conserved inprimate, but not in mammals. The deep learning system is essential forimproving sensitivity to detect primate-conserved exons, enabling theidentification of 8,973 additional exons that are deemed to be notsignificant using PhyloP method. Most of these newly identified exons(85%) are conserved in mammals but are missed by primate PhyloP methoddue to short branch length in primates.

Furthermore, the deep learning system can identify primate-conservednoncoding cis-regulatory elements. Noncoding sequence, which comprisesnearly 99% of the human genome, is highly active in gene regulation, andmost genetic variants associated with complex human diseases map tononcoding rather than protein-coding sequence, particularly tocis-regulatory elements such as active enhancers and DNase-Ihypersensitivity sites. The deep learning system can identifyprimate-conserved DNase-I hypersensitivity sites at short branchlengths, adding 57,496 DNase-I hypersensitivity sites that are missed byprimate PhyloP method. These identifications also indicate a significantfraction of cis-regulatory elements that were previously believed tolack conservation represent recent evolutionary adaptations, withsequence conservation detectable in primates.

Computer System

FIG. 24 is an example computer system 2400 that can be used to implementthe technology disclosed. Computer system 2400 includes at least onecentral processing unit (CPU) 2472 that communicates with a number ofperipheral devices via bus subsystem 2455. These peripheral devices caninclude a storage subsystem 2410 including, for example, memory devicesand a file storage subsystem 2436, user interface input devices 2438,user interface output devices 2476, and a network interface subsystem2474. The input and output devices allow user interaction with computersystem 2400. Network interface subsystem 2474 provides an interface tooutside networks, including an interface to corresponding interfacedevices in other computer systems.

In one implementation, the first model 104, the second model 114 and theevolutionary conservation determination logic 108 are communicablylinked to the storage subsystem 2410 and the user interface inputdevices 2438.

User interface input devices 2438 can include a keyboard; pointingdevices such as a mouse, trackball, touchpad, or graphics tablet; ascanner; a touch screen incorporated into the display; audio inputdevices such as voice recognition systems and microphones; and othertypes of input devices. In general, use of the term “input device” isintended to include all possible types of devices and ways to inputinformation into computer system 2400.

User interface output devices 2476 can include a display subsystem, aprinter, a fax machine, or non-visual displays such as audio outputdevices. The display subsystem can include an LED display, a cathode raytube (CRT), a flat-panel device such as a liquid crystal display (LCD),a projection device, or some other mechanism for creating a visibleimage. The display subsystem can also provide a non-visual display suchas audio output devices. In general, use of the term “output device” isintended to include all possible types of devices and ways to outputinformation from computer system 2400 to the user or to another machineor computer system.

Storage subsystem 2410 stores programming and data constructs thatprovide the functionality of some or all of the modules and methodsdescribed herein. These software modules are generally executed byprocessors 2478.

Processors 2478 can be graphics processing units (GPUs),field-programmable gate arrays (FPGAs), application-specific integratedcircuits (ASICs), and/or coarse-grained reconfigurable architectures(CGRAs). Processors 2478 can be hosted by a deep learning cloud platformsuch as Google Cloud Platform™, Xilinx™, and Cirrascale™. Examples ofprocessors 2478 include Google's Tensor Processing Unit (TPU)™,rackmount solutions like GX4 Rackmount Series™, GX36 Rackmount Series™,NVIDIA DGX-1™, Microsoft' Stratix V FPGA™, Graphcore's IntelligentProcessor Unit (IPU)™, Qualcomm's Zeroth Platform™ with SnapdragonProcessors™, NVIDIA's Volta™, NVIDIA's DRIVE PX™, NVIDIA's JETSONTX1/TX2 MODULE™, Intel's Nirvana™, Movidius VPU™, Fujitsu DPI™, ARM'sDynamicIQ™, IBM TrueNorth™, Lambda GPU Server with Testa V100s™, andothers.

Memory subsystem 2422 used in the storage subsystem 2410 can include anumber of memories including a main random access memory (RAM) 2432 forstorage of instructions and data during program execution and a readonly memory (ROM) 2434 in which fixed instructions are stored. A filestorage subsystem 2436 can provide persistent storage for program anddata files, and can include a hard disk drive, a floppy disk drive alongwith associated removable media, a CD-ROM drive, an optical drive, orremovable media cartridges. The modules implementing the functionalityof certain implementations can be stored by file storage subsystem 2436in the storage subsystem 2410, or in other machines accessible by theprocessor.

Bus subsystem 2455 provides a mechanism for letting the variouscomponents and subsystems of computer system 2400 communicate with eachother as intended. Although bus subsystem 2455 is shown schematically asa single bus, alternative implementations of the bus subsystem can usemultiple busses.

Computer system 2400 itself can be of varying types including a personalcomputer, a portable computer, a workstation, a computer terminal, anetwork computer, a television, a mainframe, a server farm, awidely-distributed set of loosely networked computers, or any other dataprocessing system or user device. Due to the ever-changing nature ofcomputers and networks, the description of computer system 2400 depictedin FIG. 24 is intended only as a specific example for purposes ofillustrating the preferred implementations of the technology disclosed.Many other configurations of computer system 2400 are possible havingmore or less components than the computer system depicted in FIG. 24 .

CLAUSES

The technology disclosed, in particularly, the clauses disclosed in thissection, can be practiced as a system, method, or article ofmanufacture. One or more features of an implementation can be combinedwith the base implementation. Implementations that are not mutuallyexclusive are taught to be combinable. One or more features of animplementation can be combined with other implementations. Thisdisclosure periodically reminds the user of these options. Omission fromsome implementations of recitations that repeat these options should notbe taken as limiting the combinations taught in the precedingsections—these recitations are hereby incorporated forward by referenceinto each of the following implementations.

One or more implementations and clauses of the technology disclosed, orelements thereof can be implemented in the form of a computer product,including a non-transitory computer readable storage medium withcomputer usable program code for performing the method steps indicated.Furthermore, one or more implementations and clauses of the technologydisclosed, or elements thereof can be implemented in the form of anapparatus including a memory and at least one processor that is coupledto the memory and operative to perform exemplary method steps. Yetfurther, in another aspect, one or more implementations and clauses ofthe technology disclosed or elements thereof can be implemented in theform of means for carrying out one or more of the method steps describedherein; the means can include (i) hardware module(s), (ii) softwaremodule(s) executing on one or more hardware processors, or (iii) acombination of hardware and software modules; any of (i)-(iii) implementthe specific techniques set forth herein, and the software modules arestored in a computer readable storage medium (or multiple such media).

The clauses described in this section can be combined as features. Inthe interest of conciseness, the combinations of features are notindividually enumerated and are not repeated with each base set offeatures. The reader will understand how features identified in theclauses described in this section can readily be combined with sets ofbase features identified as implementations in other sections of thisapplication. These clauses are not meant to be mutually exclusive,exhaustive, or restrictive; and the technology disclosed is not limitedto these clauses but rather encompasses all possible combinations,modifications, and variations within the scope of the claimed technologyand its equivalents.

Other implementations of the clauses described in this section caninclude a non-transitory computer readable storage medium storinginstructions executable by a processor to perform any of the clausesdescribed in this section. Yet another implementation of the clausesdescribed in this section can include a system including memory and oneor more processors operable to execute instructions, stored in thememory, to perform any of the clauses described in this section.

We disclose the following clauses:

1. A system, comprising:

a first model configured to

-   -   process a first multiple sequence alignment that aligns a query        sequence to N non-query sequences, wherein the query sequence        includes a masked base at a target position that is flanked by        right and left bases, and    -   predict a first identity of the masked base at the target        position; a second model configured to    -   process a second multiple sequence alignment that aligns the        query sequence to M non-query sequences, where M>N,    -   predict a second identity of the masked base at the target        position; and

an evolutionary conservation determination logic configured to measurean evolutionary conservation of the masked base at the target positionbased on the first and second identities of the masked base.

2. The system of clause 1, wherein the N non-query sequences are Nclosest homologues to the query sequence.3. The system of clause 1, wherein the M non-query sequences are Mclosest homologues to the query sequence.4. The system of clause 1, wherein the M non-query sequences include theN non-query sequences.5. The system of clause 1, wherein the query sequence belongs to a firstspecies.6. The system of clause 5, wherein the first species is human.7. The system of clause 5, wherein the N non-query sequences arenon-human primates.8. The system of clause 7, wherein the N non-query sequences belong to afirst group of species that shares a family with the first species.9. The system of clause 8, wherein the family is hominids.10. The system of clause 7, wherein the first group of species shares anorder with the first species.11. The system of clause 7, wherein the order is primates.12. The system of clause 5, wherein the M non-query sequences belong toa second group of species that shares a class with the first species.13. The system of clause 12, wherein the class is mammals.14. The system of clause 12, wherein the second group of species sharesa phylum with the first species.15. The system of clause 14, wherein the phylum is chordates.16. The system of clause 5, wherein the second group of species shares akingdom with the first species.17. The system of clause 16, wherein the kingdom is animals.18. The system of clause 1, wherein a first average evolutionarydistance determined from evolutionary distances between species in thefirst group of species to which the N non-query sequences belong is lessthan a second average evolutionary distance determined from evolutionarydistances between species in the second group of species to which the Mnon-query sequences belong.19. The system of clause 18, wherein non-overlapping species between thefirst and second groups of species are more evolutionarily distant fromthe first species than overlapping species between the first and secondgroups of species.20. The system of clause 1, wherein the first model is furtherconfigured to predict the first identity as a first probabilitydistribution that specifies base-wise likelihoods of the masked base atthe target position being adenine (A), cytosine (C), thymine (T), andguanine (G).21. The system of clause 20, wherein the first probability distributionfurther specifies base-wise likelihoods of the masked base at the targetposition being A, C, T, G, an unknown base (X), and a missing base (-).22. The system of clause 1, wherein the second model is furtherconfigured to predict the second identity as a second probabilitydistribution that specifies base-wise likelihoods of the masked base atthe target position being A, C, T, G.23. The system of clause 22, wherein the second probability distributionfurther specifies base-wise likelihoods of the masked base at the targetposition being A, C, T, G, -, and X.24. The system of clause 23, wherein the evolutionary conservationdetermination logic is further configured to measure the evolutionaryconservation of the masked base at the target position based on thefirst and second probability distributions.25. The system of clause 24, wherein the evolutionary conservationdetermination logic is further configured to use t-test statistics tomeasure the evolutionary conservation of the masked base at the targetposition.26. The system of clause 1, wherein the first model is furtherconfigured to encode a background mutation probability estimation in thefirst identity of the masked base at the target position.27. The system of clause 1, wherein the second model is furtherconfigured to encode a group-specific mutation probability estimation inthe second identity of the masked base at the target position.28. The system of clause 1, wherein the first and second models areneural networks.29. The system of clause 28, wherein the first and second models arefeed-forward neural networks.30. The system of clause 28, wherein the first and second models arerecurrent neural networks.31. The system of clause 30, wherein the first and second models arelong short-term memory (LSTM) networks.32. The system of clause 28, wherein the first and second models areconvolutional neural networks (CNNs).33. The system of clause 32, wherein the first and second models areTransformer-based CNNs.34. The system of clause 24, wherein the first and second probabilitydistributions are first and second softmax distributions.35. The system of clause 1, wherein the first and second models aretrained using a categorical cross-entropy loss function.36. The system of clause 1, wherein the background mutation probabilityestimation measures random mutation experienced in genes.37. The system of clause 1, wherein the group-specific mutationprobability estimation measures epistasis-driven mutations experiencedin genes of a particular group of species.38. The system of clause 1, wherein the evolutionary conservation ismeasured as a frequency of occurrence of a particular base at aparticular position across the multiple sequence alignment.39. The system of clause 1, wherein the evolutionary conservation ismeasured as a frequency of substitution of a particular base at aparticular position across the multiple sequence alignment.40. The system of clause 1, wherein the first and second models areconfigured to detect evolutionary distances and branch lengthsimplicitly encoded in the multiple sequence alignment as second-ordersignals.41. The system of clause 1, wherein the masked base is flanked by tworight bases and two left bases.42. The system of clause 1, wherein the first and second models areconfigured to predict alternative splicing at the masked base inresponse to processing the multiple sequence alignment that alignssequences of length greater than K.43. The system of clause 42, wherein K is greater than 5.44. The system of clause 1, wherein the first and second models areconfigured to detect conservation-driving genomic features.45. The system of clause 1, further configured to determine a functionor feature of the masked base in dependence upon the evolutionaryconservation.

What we claim is:
 1. A system, comprising: a first model configured toprocess a first multiple sequence alignment that aligns a query sequenceto N non-query sequences, wherein the query sequence includes a maskedbase at a target position that is flanked by right and left bases, andpredict a first identity of the masked base at the target position; asecond model configured to process a second multiple sequence alignmentthat aligns the query sequence to M non-query sequences, where M>N,predict a second identity of the masked base at the target position; andan evolutionary conservation determination logic configured to measurean evolutionary conservation of the masked base at the target positionbased on the first and second identities of the masked base.
 2. Thesystem of claim 1, wherein the N non-query sequences are N closesthomologues to the query sequence.
 3. The system of claim 1, wherein theM non-query sequences are M closest homologues to the query sequence. 4.The system of claim 1, wherein the M non-query sequences include the Nnon-query sequences.
 5. The system of claim 1, wherein the querysequence belongs to a first species.
 6. The system of claim 5, whereinthe first species is human.
 7. The system of claim 5, wherein the Nnon-query sequences are non-human primates.
 8. The system of claim 7,wherein the N non-query sequences belong to a first group of speciesthat shares a family with the first species.
 9. The system of claim 8,wherein the family is hominids.
 10. The system of claim 8, wherein thefirst group of species shares an order with the first species.
 11. Thesystem of claim 10, wherein the order is primates.
 12. The system ofclaim 5, wherein the M non-query sequences belong to a second group ofspecies that shares a class with the first species.
 13. The system ofclaim 12, wherein the class is mammals.
 14. The system of claim 12,wherein the second group of species shares a phylum with the firstspecies.
 15. The system of claim 14, wherein the phylum is chordates.16. The system of claim 12, wherein the second group of species shares akingdom with the first species.
 17. The system of claim 16, wherein thekingdom is animals.
 18. The system of claim 1, wherein a first averageevolutionary distance determined from evolutionary distances betweenspecies in a first group of species to which the N non-query sequencesbelong is less than a second average evolutionary distance determinedfrom evolutionary distances between species in a second group of speciesto which the M non-query sequences belong.
 19. The system of claim 18,wherein non-overlapping species between the first and second groups ofspecies are more evolutionarily distant from the first species thanoverlapping species between the first and second groups of species. 20.The system of claim 1, wherein the first model is further configured topredict the first identity as a first probability distribution thatspecifies base-wise likelihoods of the masked base at the targetposition being adenine (A), cytosine (C), thymine (T), and guanine (G).21. The system of claim 20, wherein the first probability distributionfurther specifies base-wise likelihoods of the masked base at the targetposition being A, C, T, G, an unknown base (X), and a missing base (-).22. The system of claim 1, wherein the second model is furtherconfigured to predict the second identity as a second probabilitydistribution that specifies base-wise likelihoods of the masked base atthe target position being A, C, T, G.
 23. The system of claim 22,wherein the second probability distribution further specifies base-wiselikelihoods of the masked base at the target position being A, C, T, G,-, and X.
 24. The system of claim 23, wherein the evolutionaryconservation determination logic is further configured to measure theevolutionary conservation of the masked base at the target positionbased on the first and second probability distributions.
 25. The systemof claim 24, wherein the evolutionary conservation determination logicis further configured to use t-test statistics to measure theevolutionary conservation of the masked base at the target position. 26.The system of claim 1, wherein the first model is further configured toencode a background mutation probability estimation in the firstidentity of the masked base at the target position.
 27. The system ofclaim 1, wherein the second model is further configured to encode agroup-specific mutation probability estimation in the second identity ofthe masked base at the target position.